WWPI Issue Archive: Computer Technology News - May 2002

Eight Considerations for Selecting An Effective Content Categorization Solution

By Parag Patkar

As corporations struggle with the problem of organizing content for efficient and effective retrieval, many software solutions have surfaced in the last few years. Categorization software is one such solution that holds promise.

Information categorization and retrieval software allows organizations to categorize content into directory trees or taxonomies much like an Electronic Library, Yahoo!, or the Open Directory Project (used by Google). These directories of neatly categorized data can be browsed and queried by end users to search for relevant content. Some industry analysts compare what categorization software does for unstructured content to what relational databases do for structured content.

Search engines and document management systems dominated the landscape in the early days of knowledge management. Search engines were and still are very effective in indexing and querying large amounts of data that typically reside on public domain sites serving a large number of users with vastly varying needs.

On the other hand, document management systems became popular in organizations with emphasis on structure and processes where the need for mapping content to business processes and strict adherence to business rules was of prime importance. The problem of combining these two systems has arisen as more publicly held content is now used in business processes and vice versa.

Categorization software is a viable solution in environments involving a mix of the two situations above where content is hierarchical, does not necessarily map to a specific business processes, and is voluminous and manually unmanageable but, at the same time, not as dispersed or unrelated to each other as public-domain content. The problem today is that the value of organizing such content is clear, but the map for choosing an effective solution is still muddy. Furthermore, selection of a suitable vendor is made even more complicated due to the large number of categorization software vendors to choose from—Quiver, Semio, Autonomy, and about two dozen others.

When selecting any auto-categorization tool for an enterprise or departmental project, consider the following 8 features: taxonomy generation, training, auto-categorization, real-time publishing, human oversight, personalization, API and integration, and technical specifications.

1. Taxonomy Generation
First and foremost, is there already a neatly categorized repository of data or topic hierarchy that could migrate over to the auto-categorization tool, or is it necessary to start from a scratch?

If it’s the former, then look for software tools that will let you import an existing taxonomy that may have been built with painstaking precision. Most software packages allow importing taxonomies in one form or other, but always check this functionality to ensure a seamless transition. If a taxonomy already exists, then it is also likely that processes are in place for subject matter experts and editors to continuously monitor the creation of new end user topics. Consequently, make sure to identify an intuitive interface for these workflow processes.

If starting from scratch and there is no staff of well-trained editors involved, it may be better to start off using a taxonomy assistant. These generally recommend taxonomy topics by identifying relevant clusters in unsorted or unclassified documents. If working in a specific vertical industry, a canned taxonomy may be available on the public domain or it may be possible to purchase one through a fee based service. In any case, software that identifies data clusters and recommends a taxonomy is useful to not only kickstart the process but also for ongoing maintenance. However, having the tools to effectively pick and choose which topics to implement or discard is paramount, as in many cases auto-taxonomy generation proves to be ineffective in suggesting a high volume of meaningful taxonomy topics.

2. Training
This is the area that involves defining “intelligent” taxonomies, such that the automated component of the software is able to more accurately categorize incoming documents into the relevant topics. There are two popular approaches, one that discovers a hierarchy in a cluster of unsorted documents and populates the topics accordingly (unsupervised learning). These sometimes also include a thesaurus, a map of linked concepts based on Natural Language techniques, and/or a collection of standard terms and relationships per specific topics. These systems are quicker to implement but less effective in terms of accurately categorizing data that falls outside the scope of the thesaurus or the map. They also work best when terms and concepts are finite and well defined.

The other types of systems use manual training processes, also called supervised learning, to train the taxonomy. Training in these systems happens by associating training documents to the topic or, in simpler systems, associating keywords and metadata-based business rules to the topic. These training documents or keywords essentially define each topic and its key features within the taxonomy.

In an ideal scenario, you should be able to manually tweak an automatically proposed taxonomy via training sets, keywords, and other rules-based systems. In another year or so, most systems will be incorporating each of these approaches. But for now, make sure to pick one that will best adapt to the coming convergence.

3. Auto-Categorization
Most auto-categorization processes are based on mathematical systems, the same that are used in search engines. Naïve-Bayes, Support Vector Machines, and Neural Networks are just some of the popular ones that are widely used by numerous vendors. Results using most of these software solutions seem strikingly similar. This might be due to the fact that most vendors use similar algorithms.

For example, compare the recall and precision values for each software solution. Definitions for precision and recall are easily available on the Web.

“Recall” is the measure of volume (the percent) of documents retrieved that are likely to be of interest.

“Precision” is the measure of volume (the percent) of documents rejected that are likely to be useless.

Depending on the dataset you work with, the automated portion of the categorization process should result in precision and recall values between 40-60%. Consistency over 80% precision and recall values for a wide range of datasets are as yet unheard of but will go a long way in ensuring wider recognition for this space. Bottom-line is that auto-categorization is an essential component but will not alone provide the accuracy needed to successfully organize enterprise content. Making sure the solution selected offers visibility into and control over the auto-categorization process will ensure the best possible resulting directory.

4. Real-Time Publishing
Most software packages batch the categorization process, and this is a limitation of the kind of mathematical operations involved. For folks that handle news and financial information it is required that a stream of documents be categorized in real-time. Other real-time publishing capabilities should be identified for pushing mission critical documents out immediately to end users such as product recalls or important breaking news.

5. Human Oversight
Unless algorithms are refined to achieve over 80% for precision and recall, this is a key distinguishing component among categorization solutions. A number of specific features to consider exist for this category:

The ability to create and modify taxonomy structure and contained content, including editing metadata associated with topics and documents, changing topic-document associations both pre- and post-categorization.
Degree of control over the categorization process, specifically tools to manually improve the training process and switch algorithms used for training and categorization. Any handles to the heavily mathematical categorization process are also nice to have. Most vendors have black boxed the mathematical part of the process which keeps things simple. However, it is doubtful that the next generation of these systems can avoid keeping these algorithms boxed in.
Retention of editorial actions or learning by example. Editorial action performed on the taxonomy should not only be retained but should also be able to override the automated component of the categorization process. For example, documents manually deleted from a topic should never be published into the topic again unless the document changes substantially.
Version control or the ability to store and retrieve multiple versions of the taxonomy.
Configurable alerts within the system to notify editors of new or flagged content to review.
The ability to search documents within different versions of taxonomies and underlying topics and make use of existing intelligence built within the search index for purposes of categorization.
Integration with content/document management systems such as search indexes that are repositories of unstructured content that can be used in the categorization process.
Personalized editorial access to the taxonomy. This is the ability for multiple editors to take control of specific portions of the taxonomy but still be able to interact and share knowledge with other editors.

6. Personalization
It is also important to ensure that the look and feel of the categorization output is flexible and integrates well with user specifications. For example, it should allow personalization of search results and taxonomy that includes topic- and document-level access control, the ability to drill down and search by topic, the display of documents, topics and any associated meta data, and make it easy for end-users to interact with the system.

7. API and Integration
It is imperative that your auto-categorization solution works with existing IT investments, such as search engines, portals, and content management systems. It should integrate end user output with portals and other Web-based applications, be easy to deploy over Web servers and app servers, and feed meta data like keywords back to documents or the underlying content management system.

8. Technical Specifications
Apart from standard features like modular architecture, ease of installation, and maintenance, other features to consider include support for multiple software and hardware platforms (Chipset, OS, Database, Web server, Search, Network Configuration) and the supported document types.

Finally, another important detail to take into account is the cost to implement your solution and the type of return on investment to expect. Typically, a good piece of software on some reliable hardware should cost between $100,000 to $150,000 for a departmental implementation involving 200,000 to 300,000 documents and about 1,000 topics. Consulting fees depend on specific needs and can range from between $5,000 to $30,000 if an editorial team is well trained, to about $50,000 to $100,000 if extensive hand holding is needed in regards to production quality data and taxonomy. In terms of ROI, many vendors currently offer ROI calculators on their websites. Make sure the selected vendor can show a true ROI specific to the project involved.

For current users of categorization software, hopefully this has reinforced the choice of software already made. For others yet to identify a categorization solution for the info glut of unstructured content in the enterprise, these eight considerations might help you formulate a set of selection criteria that will ensure a successful implementation. Just remember, a taxonomy is an ever-changing structure that is specific to each business. That being the case, the right categorization solution should be flexible and reflect each business’ processes and information.

Parag Patkar is the director of professional services at Quiver, Inc. (San Mateo, CA).